{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Hackathon 8\n",
    "\n",
    "Written by Eleanor Quint\n",
    "\n",
    "Topics:\n",
    "- Reinforcement Learning\n",
    "- Proximal Policy Optimization\n",
    "\n",
    "This is all setup in a IPython notebook so you can run any code you want to experiment with. Feel free to edit any cell, or add some to run your own code."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# We'll start with our library imports...\n",
    "from __future__ import print_function\n",
    "\n",
    "import numpy as np                 # to use numpy arrays\n",
    "import tensorflow as tf            # to specify and run computation graphs\n",
    "import tensorflow_datasets as tfds # to load training data\n",
    "import matplotlib.pyplot as plt    # to visualize data and draw plots\n",
    "from tqdm import tqdm              # to track progress of loops\n",
    "import gym                         # to setup and run RL environments\n",
    "import scipy.signal                # for a specific convolution function"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reinforcement Learning\n",
    "\n",
    "Rather than training a deep model to classify, we're going to try to teach one how to act optimally in an environment. Thus, rather than training with a particular dataset, we're going to create an environment with which to interact. The model will take in the environment state and choose an action to take. We want to train the model to choose the best action in any given state.\n",
    "\n",
    "We'll use OpenAI Gym to create and run the environment we'll be interacting with. The default here is [Cartpole](https://gym.openai.com/envs/CartPole-v1/), but feel free to look at and experiment with others."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "env = gym.make('CartPole-v1')\n",
    "NUM_ACTIONS = env.action_space.n\n",
    "OBS_SHAPE = env.observation_space.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We're going to use an approach to optimization called \"policy iteration\". This is where we instantiate a random policy (parameterized by a deep neural network) and optimize it in steps. The basic algorithm in this approach is called \"REINFORCE\" In each step we'll collect experiences from the environment for one episode (e.g., in Cartpole, from the initial state of the simulation until the pole falls too far). After the episode we'll calculate how well the policy did and update it to perform better next time.\n",
    "\n",
    "The family of algorithms we're going to use is a type of policy iteration called \"Actor-critic\". So called because we're going to create an actor network (the policy) and a critic (the value network). The actor will take some actions in the environment and the critic will evaluate a baseline for how good those actions should be. If the actions turn out better than the baseline, the actions are reinforced. Otherwise, if the actions are worse than expected, they are reduced in probability."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "network_fn = lambda shape: tf.keras.Sequential([\n",
    "                            tf.keras.layers.Dense(256, activation=tf.nn.tanh),\n",
    "                            tf.keras.layers.Dense(256, activation=tf.nn.tanh),\n",
    "                            tf.keras.layers.Dense(shape)])\n",
    "\n",
    "# We'll declare our two networks\n",
    "policy_network = network_fn(NUM_ACTIONS)\n",
    "value_network = network_fn(1)\n",
    "\n",
    "# Reset the environment to start a new episode\n",
    "state = env.reset()\n",
    "input_ = np.expand_dims(state, 0)\n",
    "# Evaluate the first state\n",
    "print(\"Policy action logits:\", policy_network(input_))\n",
    "print(\"Estimated Value:\", value_network(input_))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Proximal Policy Optimization (PPO)\n",
    "\n",
    "More specifically, we're going to use an algorithm called \"Proximal Policy Optimization\". The key element this algorithm adds to the actor-critic family of algorithms is a limit to how much it can change the policy in one round of training. This ensures that the policy remains stable and gets better more or less monotonically. The mechanism used to limit the change is the KL divergence between the distribution over actions before and after training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "VALUE_FN_ITERS = 80\n",
    "POLICY_FN_ITERS = 80\n",
    "KL_MARGIN = 1.2\n",
    "KL_TARGET = 0.01\n",
    "CLIP_RATIO = 0.2\n",
    "\n",
    "optimizer = tf.keras.optimizers.Adam()\n",
    "\n",
    "def discount_cumsum(x, discount):\n",
    "    \"\"\"\n",
    "    magic from the rllab library for computing discounted cumulative sums of vectors.\n",
    "    input: \n",
    "        vector x, \n",
    "        [x0, \n",
    "         x1, \n",
    "         x2]\n",
    "    output:\n",
    "        [x0 + discount * x1 + discount^2 * x2,  \n",
    "         x1 + discount * x2,\n",
    "         x2]\n",
    "    \"\"\"\n",
    "    return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1],\n",
    "                                axis=0)[::-1]\n",
    "\n",
    "def categorical_kl(logp0, logp1):\n",
    "    \"\"\"Returns average kl divergence between two batches of distributions\"\"\"\n",
    "    all_kls = tf.reduce_sum(tf.exp(logp1) * (logp1 - logp0), axis=1)\n",
    "    return tf.reduce_mean(all_kls)\n",
    "\n",
    "def update_fn(policy_network, value_network, states, actions, rewards, gamma=0.99):\n",
    "    # Calculate the difference between how the actor did and how the critic expected it to do\n",
    "    # We call this difference, \"advantage\"\n",
    "    vals = np.squeeze(value_network(states))\n",
    "    deltas = rewards[:-1] + gamma * vals[1:] - vals[:-1]\n",
    "    advantage = discount_cumsum(deltas, gamma)\n",
    "\n",
    "    # Calculate the action probabilities before any updates\n",
    "    action_logits = policy_network(states)\n",
    "    initial_all_logp = tf.nn.log_softmax(action_logits)\n",
    "    row_indices = tf.range(initial_all_logp.shape[0])\n",
    "    indices = tf.transpose([row_indices, actions])\n",
    "    initial_action_logp = tf.gather_nd(initial_all_logp, indices)\n",
    "\n",
    "    # policy loss\n",
    "    for _ in range(POLICY_FN_ITERS):\n",
    "        with tf.GradientTape() as tape:\n",
    "            # get the policy's action probabilities\n",
    "            action_logits = policy_network(states)\n",
    "            all_logp = tf.nn.log_softmax(action_logits)\n",
    "            \n",
    "            row_indices = tf.range(all_logp.shape[0])\n",
    "            indices = tf.transpose([row_indices, actions])\n",
    "            action_logp = tf.gather_nd(all_logp, indices)\n",
    "\n",
    "            # decide how much to reinforce\n",
    "            ratio = tf.exp(action_logp - tf.stop_gradient(initial_action_logp))\n",
    "            min_adv = tf.where(advantage > 0.,\n",
    "                               tf.cast((1.+CLIP_RATIO)*advantage, tf.float32), \n",
    "                               tf.cast((1.-CLIP_RATIO)*advantage, tf.float32)\n",
    "                               )\n",
    "            surr_adv = tf.reduce_mean(tf.minimum(ratio[:-1] * advantage, min_adv))\n",
    "            pi_objective = surr_adv\n",
    "            pi_loss = -1. * pi_objective\n",
    "\n",
    "        # update policy\n",
    "        grads = tape.gradient(pi_loss, policy_network.trainable_variables)\n",
    "        optimizer.apply_gradients(zip(grads, policy_network.trainable_variables))\n",
    "\n",
    "        # figure out how much the policy has changed for early stopping\n",
    "        kl = categorical_kl(all_logp, initial_all_logp)\n",
    "        if kl > KL_MARGIN * KL_TARGET:\n",
    "            break\n",
    "\n",
    "    # value loss\n",
    "    # supervised training for a fixed number of iterations\n",
    "    returns = discount_cumsum(rewards, gamma)[:-1]\n",
    "    for _ in range(VALUE_FN_ITERS):\n",
    "        with tf.GradientTape() as tape:\n",
    "            vals = value_network(states)[:-1]\n",
    "            val_loss = (vals - returns)**2\n",
    "        # update value function\n",
    "        grads = tape.gradient(val_loss, value_network.trainable_variables)\n",
    "        optimizer.apply_gradients(zip(grads, value_network.trainable_variables))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we'll put this together with collecting experiences in the environment. Below is code for training the policy for one episode. It will choose actions randomly according to the probabilities chosen by the policy network and then, once the episode is done, it will train using those experiences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "### CODE FOR 1 EPISODE OF TRAINING\n",
    "state_buffer = []\n",
    "action_buffer = []\n",
    "reward_buffer = []\n",
    "\n",
    "done = False\n",
    "state = env.reset()\n",
    "i = 0\n",
    "# Collect experiences from a one episode rollout\n",
    "while not done:\n",
    "    # store the initial state (and every state thereafter)\n",
    "    state_buffer.append(state)\n",
    "\n",
    "    # choose action\n",
    "    action_logits = policy_network(np.expand_dims(state, 0))\n",
    "    action = np.squeeze(tf.random.categorical(action_logits, 1))\n",
    "    action_buffer.append(action)\n",
    "\n",
    "    # step environment\n",
    "    state, rew, done, info = env.step(action)\n",
    "    reward_buffer.append(rew)\n",
    "\n",
    "# Run training update\n",
    "states = np.stack(state_buffer)\n",
    "actions = np.array(action_buffer)\n",
    "rewards = np.array(reward_buffer)\n",
    "update_fn(policy_network, value_network, states, actions, rewards)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Homework\n",
    "\n",
    "Write code to extend training to multiple episodes. Then, plot the mean entropy of the action distribution using the following function over the course of 25 episodes as well as the length of each episode. If the agent works well, the episodes should get longer. RL can be very random, so don't be surprised if you get a very good or bad outcome in your first try."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def categorical_entropy(logits):\n",
    "    return -1 * tf.reduce_mean(tf.reduce_sum(tf.nn.softmax(logits) * tf.nn.log_softmax(logits), axis=-1))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dl_notebooks",
   "language": "python",
   "name": "dl_notebooks"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
